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COMPLEXITY REGULARIZATION VIA LOCALIZED RANDOM 

PENALTIES 

By Gabor Lugosi 1 and Marten Wegkamp 
Pompeu Fabra University and Florida State University 

In this article, model selection via penalized empirical loss min- 
imization in nonparametric classification problems is studied. Data- 
dependent penalties are constructed, which are based on estimates 
of the complexity of a small subclass of each model class, contain- 
ing only those functions with small empirical loss. The penalties are 
novel since those considered in the literature are typically based on 
the entire model class. Oracle inequalities using these penalties are 
established, and the advantage of the new penalties over those based 
on the complexity of the whole model class is demonstrated. 

1. Introduction. In this article, we propose a new complexity-penalized 
model selection method based on data-dependent penalties. We consider the 
binary classification problem where, given a random observation X G 
one has to predict Y G {0, 1}. A classifier or classification rule is a function 
/:R d ->{0,l}, with loss 

L(f) d ^F{f(X)^Y}. 

A sample V n = (Xi,Yi), . . . , (X n , Y n ) of n independent, identically distributed 
(i.i.d.) pairs is available. Each pair (Xi,Yi) has the same distribution as 
(X, Y) and T> n is independent of (X, Y). The statistician's task is to select a 
classification rule f n based on the data T> n such that the probability of error 

L(f n )=F{fn(X)^Y\V n } 

is small. The Bayes classifier 

f*(x) d = I{P[Y = l\X = x}> F[Y = 0\X = x]} 
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(where I denotes the indicator function) is the optimal rule as 
L* d ^ inf L(/) = L(/*), 

/:Rd->{0,l} 

but both /* and L* are unknown to the statistician. In this article, we study 
classifiers / : M. d — > {0, 1} which minimize the empirical loss 



1 n 



n . 
i=i 

over a class of rules T . For any / £ T minimizing the empirical probability 
of error, we have 

EL(f) - L* = EL(f) - L* + E(L - £)(/) 

= EinfL(/)-L*+E(L -£)(/) 

< inf EL(/)-L* + E(L -£)(/) 

= infL(/)-L* + E(L -£)(/). 

Clearly, the approximation error 

inf L(/) - L* 

is decreasing as becomes richer. However, the more complex T ', the more 
difficult the statistical problem becomes: the estimation error 

E(L -£)(/) 

increases with the complexity of T ' . In many approaches to the problem 
described above, one fixes in advance a sequence of model classes Fi, . . . , 
whose union is T . Denote by ft a function in T k having minimal empirical 
loss and by L* k = inff e jF fe L(f) the minimal loss in class T k . The problem of 
penalized model selection is to find a possibly data-dependent penalty C k , 
assigned to each class J- k , such that minimizing the penalized empirical loss 

L(f) + C k , f€F k , k = l,2,..., 
leads to a prediction rule 

/ = f f k , where k d = arg mm(L(f k ) + 6 k ), 

with smallest possible loss. 

The main idea is that since f k minimizes L{f) over / £ J- k , we find, by 
the argument described above, that 

EL(/ fc ) -L*<L* k -L* + E(L - L)(f k ). 
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Our goal is to find the class T k such that L(f k ) is as small as possible. 
To this end, a good balance has to be found between the approximation 
and estimation errors. The approximation error is unknown to us, but the 
estimation error may be estimated. The key to complexity-regularized model 
selection is that a tight bound for the estimation error is a good penalty C k . 
More precisely, we show in Lemma 2.1 that if, for some constant 7 > 0, 

P{3 fc <(L-L)(A)}<_Xj, 

n z k z 

then the oracle inequality 

EL(/) — L* < inf(Xt — L* + KC k ) + 2 7 n" 2 

k 

holds, and also a similar bound, 

L(f) - L* <mf(L* k - L* + 2C k ), 

holds with probability greater than 1 — Ajn~ 2 . This simple result shows 
that the penalty should be, with large probability, an upper bound on the 
estimation error, and to guarantee good performance the bound should be 
as tight as possible. 

Originally, distribution-free bounds, based on uniform-deviation inequal- 
ities, were proposed as penalties. For example, the structural risk minimiza- 
tion method of Vapnik and Chervonenkis [27] uses penalties of the form 



Ck = l\ 



'logS fc (2n) + logfc 



n 



where 7 is a constant and S k (2n) is the 2ra-maximal shatter coefficient of 
the class 



that is, 



(1.1) 



A k = {{x:f(x) = l}, fef k }, 



§ fc (2ra)= max \{{xi, . . . ,x 2n } H A, A G A k }\ 

Xi,...,X2n 

= max |{(/(xi),...,/(x 2 n)), /G-Ffcll; 



see, for example, [9, 26]. The fact that this type of penalty works follows 
from the Vapnik-Chervonenkis inequality. Such distribution-free bounds 
are attractive because of their simplicity, but precisely because of their 
distribution- free nature they are necessarily loose in many cases. 

Recently, various attempts have been made to define the penalties in a 
data-dependent way to achieve this goal; see, for example, [2, 11, 13, 15, 17, 
19, 22]. 
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For example, in [2] and [11] random complexity penalties based on Rade- 
macher averages were proposed and investigated. Rademacher averages are 
denned as 



1 

sup -XVWM^} 



V,, 



where a\, . . . ,a n are i.i.d. symmetric { — 1, l}-valued random variables inde- 
pendent of T> n . The reason why this penalty was introduced is based on the 
fact that 

E sup (L-L)(/)xE% 

(see, e.g., [25]), and since Rj^ k can be shown to be sharply concentrated 
around its mean. In fact, concentration inequalities have been a key tool in 
the analysis of data-based penalties (see [19]) and this paper relies heavily 
on some recent concentration results. 

The model selection method based on Rademacher complexities satisfies 
an oracle inequality of the rough form 



(1.2) 



L* < inf 

k 



L%-L*+ 11 ER Tk + 1 ^ 



'log k 



n 



(see [2] and [11]) for values of the constants 71,72 > 0. The advantage of this 
bound over the one obtained by the distribution-free penalties mentioned 
above may perhaps be better understood if we further bound 



/Elog2S fc (A7) 
2n 



where 



(1.3) 



S k (X{ 



\{{X 1 ,...,X n }nA:A = {x:f(x) = l}, f £ T k }\ 

\{(f(x 1 ),...j(x n )), fer k }\, 



is the random shatter coefficient of the class j- k , which obviously never 
exceeds the worst-case shatter coefficient S k (n) and may be significantly 
smaller for certain distributions. 

However, this improved penalty is still not completely satisfactory. To see 
this, recall that by a classical result of Vapnik and Chervonenkis, for any 
index k, 



(1.4) EL(J k )-L%<c[ 



'L^ElogSfcpTf) ElogS fc (Xf) 



+ 



n 



n 
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which is much smaller than the corresponding expected Rademacher average 
if Lt is small. (For explicit constants we refer to Theorem 1.14 in [16].) Since 
in typical classification problems the minimal error L k in class Tk is often 
very small for some k, it is important to find penalties which allow derivation 
of oracle inequalities with the appropriate dependence on L* k . In particular, a 
desirable goal would be to develop classifiers / for which an oracle inequality 
resembling 

EL{/) - V < mU - V + jLl-^MXt) E logak (X t U 
k I y n n 

holds for all distributions. The main results of this article (Theorems 4.1 and 4.2) 
show that estimates of the desired property are indeed possible to construct 
in a conceptually simple way. 

By the key Lemma 2.1, it suffices to find a data-dependent upper estimate 
of (L — L)(fk) which has the order of magnitude of the above upper bound. 
The difficulty is that L* k and ElogSfc(X™) both depend on the underlying 
distribution. 

The improvement is achieved by decreasing the penalties so that the 
supremum in the definition of the Rademacher average is not taken over 
the whole class J-k but rather over a small subclass J-k containing only func- 
tions which "look good" on the data. More precisely, define the random 
subclass Tk C Tk by 

T k = {/ € T k : L(f) < 7l L(/ fc ) + 72 n" 1 log S fc (*T) + Tsn" 1 log(nA;)} 

for some nonnegative constants 71 , 72 and 73 . 

Risk estimates based on localized Rademacher averages have been con- 
sidered in several recent papers. The most closely related procedure is pro- 
posed by Koltchinskii and Panchenko [12], who, assuming infj g ^L(/) = 0, 
compute the Rademacher averages of subclasses of T with empirical loss 
less than r for different values of r obtained by a recursive procedure, and 
obtain bounds for the loss of the empirical risk minimizer in terms of the lo- 
calized Rademacher averages obtained after a certain number of iterations. 
Our approach of bounding the loss is conceptually simpler: it suffices to 
compute the Rademacher complexities at only one scale which depends on 
the smallest empirical loss in the class and a term of a smaller order deter- 
mined by the shatter coefficients of the whole class. Thus, we use "global" 
information to determine the scale of localization. Bartlett, Bousquet and 
Mendelson [3] also derive closely related generalization bounds based on 
localized Rademacher averages. In their approach the performance bounds 
also depend on Rademacher averages computed at different scales of local- 
ization, which are combined by the technique of peeling. For further recent 
related work, we also refer to [7, 8, 24]. 
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The rest of the paper is organized as follows. Section 2 presents some basic 
inequalities on model selection, which generalizes some of the results in [2]. 
Section 3 proposes a simple but suboptimal penalty which already has some 
of the main features of the penalties presented in Section 4. It shows, in a 
transparent way, some of the underlying ideas of the main results. Section 4 
introduces a new penalty based on the Rademacher average and it is 
shown that the new estimate yields an improvement of the desired form. 



2. Preliminaries. In this section we present two basic auxiliary lemmata 
on model selection. The first lemma is general in the sense that it does not 
depend on the particular choice of the penalty C k . This result was mentioned 
in the Introduction and generalizes a result obtained by Bartlett, Boucheron 

and Lugosi [2] . We recall that the penalized estimator is defined by / = f ft , 
with k = f argmin fc >i(£(/ fe ) + C k ). 

Lemma 2.1. Suppose that the random variables C\,C2, ■ ■ ■ are such that 
nC k <(L-L)(f k )}< 



n 2 k 2 

for some 7 > and for all k. Then we have 

EL(f) — L* < mf[L* k - L* + EC k ] + ^J. 

k n z 

It is clear that we can always take C k <1. 

Proof of Lemma 2.1. Observe that 

Esup{(L - L){f k ) - C k } < p(sup[(L - L){f k ) - C k ) > o) 

k Ik ) 

(since sup fc [(L - L)(f k ) - C k ) < 1) 

00 

<J2n(L-L)(fk)-C k >0} 

k=l 

(by the union bound) 



< 



E 



7 



k =i n2fc2 



(by assumption) 



A- 
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Therefore, we may conclude that 

EL(f) — L* = E[L(f ) — L* + C- k ] + E[(L - £)(/) - q] 

(where A; is the selected model index, i.e., f = ft) 

<Einf [£(/*) - L* + +E[(L - £)(/) - q] 

k 



(by definition of / 



< Einf 

k 



inf L(f)-L* + C k 



+ Esup[(L-L)(/ fc )-C fc ] 



(by definition of f k 



< inf 



inf L{f)-L*+EC k 



+ E S np[(L - L)(f k ) - C k ] 
k 



(interchange E and inf ) 

<m{[L* k -L*+MC k } + ^ 
k n z 

(by the preceding display) 
and the proof is complete. □ 

The preceding result is not entirely satisfactory for the following reason. 
Although it presents a useful bound, it is a bound for the average risk be- 
havior of /. However, the penalty is computed on the data at hand, and 
therefore the proposed criterion should have optimal performance for (al- 
most) all possible sequences of the data. The following result presents a 
nonasymptotic oracle inequality which holds with large probability and an 
asymptotic almost-sure version. 



Lemma 2.2. Assume that, for all k,n> 1, 



and 



nC k <{L-L){f k )}< 
nCk<(L-L)(f* k )}< 



1 



n 2 k 2 

1 



n 



2 k 2- 



Then, for all n>\ we have 



L(f) — L* > mf(L% — L* + 2C k ) 

k 



< 



4 7 



and the asymptotic almost-sure bound 



liminf { L{f) - L* < in£(L% - L* + 2C k ] 



<s 
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Proof. Let k be the selected model index. Notice that 
L(f) = L(f) + C k + (L-L)(f)-C k 

< inf[2(/ fc ) + C k ] + sup[(L - L)(f k ) ~ C k ] 

k k 

< inf [L{f k ) + C k ] + sup[(L - L)(f k ) - C k ] 

k k 

< inf [L* k + 2C k ] + sup[(2 - L)(f* k ) - C k ] + sup[(L - L)(f k ) - C k }. 

k k k 

By assumption, the last two terms on the right-hand side satisfy 

2 7 4 7 



sup[(L - L)(f* k ) - C k ] + sup[(L - L)(f k ) -C k }>0 

k k 



n 2 k 2 n 2 ' 

k =i 



proving the first inequality. The almost-sure statement is a direct conse- 
quence of the Borel-Cantelli lemma. □ 

3. A simple version. The purpose of this short section is to offer a sim- 
plified yet suggestive illustration of the ideas. As discussed in the Introduc- 
tion, an ideal penalty would be a tight upper bound for the expression on 
the right-hand side of (1.4). Motivated by this bound, we propose the simple 
penalty 



C k = 2 X 2L { f k ) + 8 l°gS fc (2n) + 2tog(nfe) lo g S k (2n) + ^ogn^^ 
V n V n n 

where S k (2n) is the (worst-case) 2n-shatter coefficient defined in (1.1). Thus, 
the minimal loss L k in class T k is estimated by its natural empirical counter- 
part L(f k ) = inf f £ jr k L(f) and the expected logarithmic shatter coefficient 
ElogSfc(Af) is estimated by the distribution-free upper bound logSfc(2n). 
[This term may be bounded further by Vfclog(2n+ 1), where V k is the 
VC-dimension of the set The auxiliary terms n~ 1 \og(nk) are neces- 
sary to derive the desired oracle inequalities. The next theorem shows that 
the proposed penalty indeed works. 

Theorem 3.1. Consider the penalized empirical loss minimizer f with 
the data-based penalty C k defined above. Then, for every n and for all dis- 
tributions of (X, Y) , 

EL(/) - L* < inf (L% - L* + EC fe ) + ^. 

k n z 

In particular, 



EL(f) - L* < inf 



'L£ + ^{logS fc (2n) + 21og(nfc)} 
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x / logS fc (2ra) | Jogjnk) 
V n n 



16 
+ — • 



The proof uses Lemma 2.1 and the following uniform deviation bound due 
to Vapnik and Chervonenkis [27]. (The slightly improved form used here is 
proved by Anthony and Shawe- Taylor [1].) 

Proposition 3.2. Let § fc (X 2n ) be the random shatter coefficient of A k 
based on i.i.d. observations X\, . . . ,X 2n defined in (1.3). For all e > and 
n> 1, 

(3.1) p(supL(/) - 2L(f) > 2e\ < 4ES k (Xf n ) exp(-ne/4) 



and 



(3.2) p(supL(/) - 2L(/) > 2e| < 4ES fc (X 2n ) exp(-ne/4). 

PROOF. Observe that, for all e > and n > 1, 

(su P L(/) - 2L(f) > 2e\ C (sup L(/) /r ~^ (/) > v^l, 
and similarly, 

(su P L(/) - 2L(f) > 2e\ C (sup L ^~±^ > ^j. 

J 1/e.F vL(/) J 

The proposition follows by [1]. □ 

Proof of Theorem 3.1. We start with the proof of the first inequality 
of Theorem 3.1. In view of Lemma 2.1, it suffices to show that 

P{L(/ fc ) - L(f k ) > C k } < 8/(nk) 2 . 

Consequently, by (3.2), 

P ( 2 £ ( / t) + 8 MMM + 16 !^< i(A) } 

t n n J 

= P{L(A) - 2L(f k ) > 8 i^M + 16 M^1 } 

< 4S fc (2n) ex P ( - 2 (g^M + 16 ^1 

L 8 V K K 



I? 



2 A: 2 ' 
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so that 



nCk>C k }>l-A/{nkf, 



where 



logSfc(2n) log(nfc) 



C k = 2^L{f k )-\ ° » v ' +2 



n n 
Another application of inequality (3.2) yields 

P{L(f k ) - L(f k ) > C k } 

< F{L(f k ) ~ L(fk) >Ck} + j^ 2 

/« s f n VlogS fc (2n) log(nfc)\l 



(nfc) 2 

Conclude via Lemma 2.1 that 

EL(/)<min(L fe + EC fc ) + 4- 
For the second inequality, deduce that for all 5 > 0, 



EV^/fc) + 6 < \JEL(f k ) + 8 < E inf L(f) +5<JL* k + 5, 

V t&-Fk v 

by Jensen's inequality and the definition of f k . □ 

The bound of Theorem 3.1 has the right dependence on L* k as suggested by 
inequality (1.4) mentioned in the Introduction. In particular, if L* k happens 
to equal zero for some class T k , then the upper bound has an improved 
rate of convergence. The disadvantage of the simple penalty defined above 
is that instead of the expected shatter coefficients, a distribution-free (and 
therefore suboptimal) upper bound appears for each class T k . 

Recently, Boucheron, Lugosi and Massart [4] proved that logSfc(Xf) con- 
centrates sharply around its mean. For example, we have the following in- 
equalities. 

Proposition 3.3. For all e > 0, n > I, 

PpElogSfcpq 1 ) > 21ogS fc (Xf) + 2e] < e~ £ , 
PpogSfc(X?) > 2Elog§ )t (A7) + 2e] < e~ £ . 
Moreover, for each n > 1, 

Elog§ fc (A7) < logESfcptf) < -i-ElogS^Xf) < 2E log 

ml 
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This proposition implies that the expected random log shatter coefficients 
ElogSfc(A" 1 1 ) of T k may be replaced by a constant times logSfc(Xf) and 
vice versa. Hence we may replace the distribution-free bounds logSfc(2n) 
by empirical estimates logS^Xf), at the price of slightly worse constants. 
The main oracle inequalities in Section 4 are accompanied by asymptotic 
almost-sure versions of bounds for the expected value. Such bounds are easy 
to obtain as well, simply by invoking Lemma 2.2 instead of Lemma 2.1. The 
details are omitted here. 



4. Rademacher penalties. The main results of the paper are presented 
in this section. Assign to each model class J- k , 

fAU . ^ 41og§ fc (AT) + 91og(nfc) 

(4.1) u k = lQ , 

n 

with S/ C (X 1 1 ) defined in (1.3), and the class 

(4.2) T k = {/Gf fc : L(f) < 16L(f k ) + 15%}. 

Observe that the class T k contains only those classifiers whose empirical 
loss is not much larger than that of the empirical minimizer. Note that 
the constant 16 has no special role; it has been chosen by convenience. Any 
constant larger than 1 would lead to similar results, at the price of modifying 
other constants. The term u k depends on the shatter coefficient of the whole 
class T k but it is typically small compared to L(f k ). 

The penalty is calculated in terms of the Rademacher average of this 
smaller class. More precisely, define the complexity estimate by 

(4.3) C k = (8R 9 + 20ra -1 log(nfc) + 2 log(nfe) • V8L(f k ) + 7u k ) A 1. 

k 

Again, not too much attention should be paid to the values of the con- 
stants involved. We favored simple readable proofs over optimal constants. 
Note that, through S k (Xf), the penalty also depends on the random shatter 
coefficient of the whole class T k . However, the term involving the shatter 
coefficient of the entire class J- k , 



n-yiogfaAO-logSfcpTf), 

is typically much smaller (by a factor n" 1 / 2 ) than the Rademacher average 
of the whole class T k . [For instance, see (4.8) and Proposition 4.6.] 

We have the following performance bound for the expected loss of the 
minimizer / of the penalized empirical loss L(f k ) + C k . 

Theorem 4.1. For every n, 

~ 22 

EL(/) - L* < m£(L% - L* + EC k ) + — . 

k n z 
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In addition, with probability greater than 1 — 44/n 2 ; 



and also 



L(f)-L*<mf(L* k -L* + 2C k ), 



liming L(f) — L* < mf{L* k - L* + 2C k ) 

k 



1. 



The next theorem is here to point out that the bound above is indeed a 
significant improvement over bounds of the type (1.2), and that the depen- 
dence on the minimal loss L k and the random shatter coefficient has the 
form suggested by (1.4). For this purpose, we introduce 

8ElogS fc (Xf ) + 171og(nfc) 



(4.4) 
and the class 

We also set 



u k = 16- 



n 



T k = {f^T k : L(f) < UL* k + 63u fc }. 



e k = 2n 1 log(nfc). 

Theorem 4.2. The following oracle inequality holds: 

EL(f) — L* < minLLt — L* + 8ER T + 15e k + WVLt + u k ■ ^2^} + 22n" 2 . 
fe>i k 

In particular, there exist universal constants 71 and 72 such that 



EL(/)-L*<inNL£-L* + 7 ii 



'L£-(ElogS fc pq»)Vlog(nfc)) 



n 



+ 72 



ElogSfc(Xf) Vlog(nfc) 



n 



This oracle inequality has the desired form outlined in the Introduction and 
improves upon the results of [2] and [13]. For example, in the special case 
when Lt = for k>ko, we obtain, for some numerical constants c\ and C2, 

- ElogS fc (Xf)Vlog(nfc) , c 2 

EL(/) < mmci h —5, 

k>k Q n n z 

which is of a different order of magnitude from the penalties considered 
by [2] and [13]. Theorem 4.2 is only stated for the expected loss but an 
inequality which holds with "large" probability may be obtained just as in 
Theorem 4.1. 
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Proofs of Theorems 4.1 and 4.2. First, recall the definitions of u k and 
u k in (4.1) and (4.4), respectively, and in addition define 

21ogE§ fe (X?) + 21og(nfc) 

Uk = o 

n 

and the event 

B k A = {u k <u k <u k }. 
Observe Proposition 3.3 yields that, with probability at least 1 — \/{nk) 2 , 

u fc = ^{logES fc (A7) + log(n£0} 

1 c 

<— {2ElogS fc (*?)+log(nfc)} 
n 

<— {2[21ogS fc (X 1 ")+41og(nA;)]+log(nA)} 
n 

= u k 

1 (' 

< — {4[2Elog§ fc (A7) + 41og(nfc)] + 91og(nfc)} 
n 

1 (' 

= — {8ElogS fc (A7) + 171og(nJfe)} 
n 

= u k 

and therefore 

(4.5) PB c k < (nk)~ 2 . 

Finally, we introduce the event 

A k = { sup L(f) - 2L(f) < u k \ n ( sup L{f) - 2L{f) < u k \ 
and the class 

H = {f^T k : L(f)<4L* k + 3u k }. 

The following intermediate result will be useful in the proofs of both theo- 
rems. 

Lemma 4.3. We have 

9 



(4.6) F{A k nB k }>l 
and on the set A k n B k the following hold: 
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(ii) T k Q^k> an d in particular, Rp* < R~ . 

(iii) L* k <2L(f k ) + u k . 

Proof. To begin with, notice that 

ES*(Xf) < m k (X?)S k (X 2 n n +1 ) = E 2 S k (X?) 

by the definition of the shatter coefficient and by the independence of the Xi. 
Thus, by Proposition 3.2, 



P^<8ES fc (X 1 2 ")exp(-^) < 



n 



2 k 2- 



This bound and (4.5) imply assertion (4.6). To prove claim (i), observe that 
on Ak, 

L(f k ) < 2L(fk) + u k (by definition of A k ) 

< 2L(f* k ) + u k (by definition of f k ) 

< 2(2L* k + u k ) + u k (by definition of A k ) 
= AL* k + 3u k . 

For claim (ii), notice that, for any / £ T k , 

L(f) < 2L(f) + u k (by definition of A k ) 

< 2[AL* k + 3u fc ] + u k (by definition of F* k ) 
= 8L* k + 7u k 

< 8L(f k ) + 7u k (by definition of L* k ) 

< 16L(f k ) + 15u k (by definition of A k ) 

< 16L(f k ) + 15u k (by definition of B k ). 

Claim (ii) now follows. Claim (iii) is immediate from the definition of A k 
since both f k and f k belong to T k . □ 

Next we link the Rademacher average Rjr* to Esupjgjp-. \L(f) — L(f)\. 
By a classical symmetrization device (cf. [10] or [25]), 

(4.7) Esup |L(/)-L(/)|<2EiW 

Also, Rt* is known to concentrate sharply around its mean. For example, 
we have, by results of [4, 5], the following bounds. 
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Proposition 4.4. For all e > 0, n> 1, 
F[Rr k > 2ER? k + e] < e~ 6ne / 5 and F[R^ k < \ER Tk - e] < e" ne . 

Proof. Define Z = nRr h - Then it follows from [4] that 
logEcxp(A(Z - EZ)) < EZ{e x - 1 - A), 
which implies further that, for < A < 3, 

XEZ 

logEexp(A(Z-EZ)) < 



2(1- A/3) 

After an application of Markov's inequality, we find 



F[Z > EZ + V2EZx + x/3] < e~ x . 

We obtain the desired upper-tail bound by inserting Z = nRp k in the pre- 
ceding display and invoking the inequality 2y / xy < x + y. The bound for the 
lower tail follows from the inequality 

<EZ-V2xEZ] < e~ x 



(see [4]) and since x + ^y > \J1xy. □ 

Finally, we make key use of the following concentration inequality for the 
supremum of an empirical process, recently established by Talagrand [23]; 
see also [14, 19, 21]. The best-known constants reported here have been 
obtained by Bousquet [6]. 

Proposition 4.5. Set £jr* = sup /&F » L(f)(l - L(f)). For all e > 0, 
n> 1, 

sup |L(/)-L(/)|>2E sup |L(/)-L(/)|+£W2i+^l <e~ ne . 
We are now ready to prove Theorems 4.1 and 4.2. 

PROOF of Theorem 4.1. Deduce, using (i), (ii) and (iii) of Lemma 
4.3, the following string of inequalities: 

nWk) > Ufk) + c k } n A k n B k ] 

= n{L(f k )>L(f k ) + 8R~ k 



+ 10e k + V8L(f k ) + 7u k V2^} nA k DB k 
<P[{3/€^:L(/) >£(/) + 8% 
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+ 10e fc + V8L(f k ) + 7u k V2^} D A k n B k ] 
[by property (i)] 

<F[{3fen-Hf)>L(f) + 8Rr* 



+ 10e k + V8L(f k ) + 7u k V2^} H A fc n 

[by property (ii) and definition of £?&] 
<P[{3/G^:L(/)>L(/) + 8% 



[by property (iii)] 
< p( sup \L(f) - L(f)\ > 8R~ + We k + E-y^), 

where the last inequality follows from 

= sup Var(I{/(X) ^ F}) < sup L(/) < 4L£ + 3u k . 

fer* k f^i 

Invoke (4.7), (4.6) and Propositions 4.4 and 4.5 to conclude that 

nHfk)>Ufk)+c k } 

< p( sup \L(f) - L(f)\ > + We k + E^v^j + 



< P sup \L(f) - L(f)\ > m.R n + 2e k + Z^V^k} + 



n 



2 k 2 



[since F(A k n £ fc ) c < 9/(n 2 k 2 ) by (4.6) in Lemma 4.3] 



n 



10 



(by Proposition 4.4) 



< 



4e fc 



sup \L(f) - L(f)\ > 2E sup \L(f) - L(f)\ + -f + £~ V2i* 



+ 



10 



< 



n 
11 

^2 



2 £; 2 



[by (4.7)] 

(by Proposition 4.5). 



This inequality and Lemma 2.1 imply the first assertion of the theorem. 
The other statements-the probability bound and the almost-sure statement- 
follow by invoking Lemma 2.2 and the preceding argument, which also shows 
that 

F{C k <(L -L){ft)}< 



n 



2 fc 2 ' 
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although the last assertion could be shown in a much easier way as it only 
involves a single function ft. The proof of Theorem 4.1 is complete. □ 

In the proof of Theorem 4.2 we need the symmetrization device 

(4.8) ERjr < 2E sup \L(f) - L(f)\ + SUp /^ L(/) 

(see, e.g., [20], page 18), and also the following result due to Massart [18]. 
(The version stated here is taken from [16].) 



Proposition 4.6. Set T, k = supy gJFfe V /L(/)(1 - L(f)). Then, for all 
n> 1, 



e sup m - l ( /)i < «"°«g»w> + 4 Mlw£). 

f&H n V n 

Proof. The statement follows almost immediately from Theorem 1.10 
in [16] by noting that the worst-case shatter coefficients may be replaced 
with impunity by the random shatter coefficients. □ 

PROOF of Theorem 4.2. Observe that on the event A k nB k , T k Q Fk, 
where J- k is as defined in Theorem 4.2. Indeed, for any / £ 

L(f) < 2L(f) + u k (by definition of A k ) 

< 2[16L(/fc) + Ihuk] + u k (by definition of Th) 

< 32L(/fc) + 3lv,k (by definition of B k ) 

< 32L(/ fc *) + Zlu k (by definition of f k ) 

< 32[2L* k + u k ] + 31u k (by definition of A k ) 
= UL* k + Q3u k . 

Also, we notice that on the event A k , 

L(f k )<L(r k )<2L* k + u k . 

These observations imply that 



C k I Ak nB k < 8% + 10e fc + 2J64L* k + 63u fc V2£fc 



< SRy k + 10e fc + 16 JL* k + u k ^2T k . 
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Consequently, it follows from Lemma 4.3 that 
EC k <EC k I Ak +F(A k nB k ) c 

< 8ER Tk + I0e k + 16^/Z* + u k ^2T k + 9(nk)~ 2 

This bound and Theorem 4.1 yield the first inequality of Theorem 4.2. The 
second inequality follows from the symmetrization (4.8) and Proposition 4.6. 
□ 
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